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1.  Introduction 


Consider  ths  large  spsrss  eyelets  of  linear  equations 

Ax  ■ b (1.1) 


where  A la  an  N«N  sparse  nonsysmetrlc  nstrlx  and  x and  b are  vectors  of  length  N.  Aasune  that 
A can  be  factored  In  the  forn 

A - LOU  (1.2) 

where  L la  a unit  lower  triangular  matrix,  D a nonsingular  diagonal  matrix,  and  U a unit  upper 
triangular  matrix.  Then  an  Important  method  for  solving  (1.1)  Is  sparse  Gaussian  elimination, 
or,  equivalently,  first  factoring  A as  in  (1.2)  and  then  successively  solving  the  systems 

Ly  • b,  0*  - y,  Ux  - x.  (1.3) 

Recently,  several  Implementations  of  sparse  Gaussian  elimination  have  been  developed  to 
solve  systems  like  (1.1)  (cf.  Curtis  and  Reid  [2],  Elsenstat,  Schultz,  and  Sherman  [5], 

Guatavson  (6],  and  Rhelnboldt  and  Mesztenyl  [7]).  The  basic  Idea  of  all  of  these  is  to  factor  A 
and  compute  x without  storing  or  operating  on  zeroes  In  A,  L,  or  U.  Doing  this  requires  a 
certain  amount  of  storage  and  operational  overhead,  l.e.  extra  storage  for  pointers  In  addition 
to  that  needed  for  nonzeroes  and  extra  nonnumeric  "bookkeeping"  operations  In  addition  to  the 
required  arithmetic  operations.  All  these  Implementations  of  sparse  Cauaslan  elimination 
generate  the  same  factorization  of  A and  avoid  storing  and  operating  on  zeroes.  Thus,  they  all 
have  the  same  costs  as  msssured  in  terms  of  the  mmfcer  of  nonzeroes  in  L and  U or  the  number  of 
arithmetic  operations  performed.  The  Implementations  do,  however,  have  different  overhead 
requirements,  and  thus  their  total  storage  and  tins  requirements  vary  a great  deal. 

In  this  paper £e* discuss*  the  design  of  sparse  Gaussian  elimination  codes'^  We  are 
particular  ^interested  ln^the  effects  of  certain  flexibility  and  cost  constraints  on  the 
design,  and 'w^exailna-*  possible  tradeoffs  among  the  design  goals  of  flexibility,  speed,  end 
small  slse.^ 

— 

la  (action  2 we  describe  a basic  design  due  to  Chang  [1],  which  has  been  used 
affectively  In  the  Implementations  referred  to  above.  Next,  in  Section  3 we  discuss  the 
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vtottgc  of  apart*  matrices  and  present  two  storage  schemes  for  use  with  spars*  Causslan 
elimination.  In  Section  4 we  describe  three  specific  lnplensntatlons  that  Illustrate  the  range 
of  possible  tradeoffs  anong  design  goal*.  Finally,  In  Section  5 we  give  sone  quantitative 
cooper Isons  of  the  three  lapleaentatlona. 


2.  A Basic  Implementation  Design 

Several  years  ago  Chang  [1]  suggested  a design  for  the  lnplenentatlon  of  sparse  Causslan 
elimination  that  has  proved  to  be  particularly  robust.  He  proposed  breaking  the  conputatlon  up 
Into  three  distinct  steps:  symbolic  factorization  (SYMFAC).  numeric  factorization  (NUMFAC), 

and  forward-  and  back-solution  (SOLVE).  The  SYMFAC  step  computes  the  zero  structures  of  L and 
D (i.e.  the  positions  of  the  nonzeroes  in  L and  U)  from  that  of  A,  disregarding  the  actual 
numerical  entries  of  A.  The  NUMFAC  step  then  uses  the  structure  information  generated  by 
STMFAC  to  compute  the  numerical  entries  of  L,  D,  and  U.  Finally,  the  SOLVE  step  uses  the 
numerical  factorization  generated  by  NUMFAC  to  solve  the  system  (1.3). 

The  main  advantage  of  splitting  up  the  computation  as  described  here  Is  flexibility.  If 
several  linear  systew  have  identical  coefficient  matrices  but  different  rlghthand  sides,  then 
only  one  SYMFAC  and  one  NUMFAC  are  needed;  the  different  rlghthand  sides  require  only  separate 
SOLVE  steps.  (This  situation  arises  In  the  use  of  the  chord  method  to  solve  a system  of 
nonlinear  equations.)  Similarly,  a sequence  of  linear  systems,  all  of  whose  coefficient 
matrices  have  Identical  zero  structure  but  different  numerical  entries,  can  be  solved  using 
Just  one  SYMFAC  confined  with  separate  NUMFAC  and  SOLVE  steps  for  each  system.  (This  situation 
arises  when  Newton's  method  Is  used  to  solve  a system  of  nonlinear  equations.) 

A drawback  to  the  three-step  design  la  that  It  Is  necessary  to  store  the  descriptions  of 
the  sero  structures  and  the  actual  numerical  entries  of  both  L and  0.  In  essence,  the  great 
flexibility  is  paid  for  with  a large  amount  of  extra  storage.  By  giving  up  some  flexibility. 

It  Is  possible  to  reduce  substantially  the  storage  requirements.  For  example,  combining  NUMFAC 
and  SOLVE  Into  a single  NUMSLV  step  eliminates  the  need  for  storing  the  numerical  entries  of  L. 
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It  Is  no  longer  possible,  however,  to  handle  nultlple  right hand  sides  so  efficiently.  Then 
again.  If  all  three  ateps  are  combined  into  one  TKKSLV  step.  It  la  unnecessary  to  store  even  a 
description  of  the  zero  structure  of  L.  But  by  combining  steps  In  this  way,  we  lose  the 
ability  to  solve  efficiently  sequences  of  systems  all  of  whose  coefficient  matrices  have 
Identical  zero  structure. 


3.  Storage  of  Sparse  Matrices 

In  this  section  we  describe  two  storage  schemes  that  can  be  used  to  store  A,  L,  and  U.  The 
schemes  are  designed  specifically  for  use  with  sparse  Gaussian  elimination  and  they  exploit  the 
fact  that  random  access  of  sparse  matrices  Is  not  required. 

He  call  the  first  storage  scheme  the  uncompressed  storage  scheme.  It  has  been  used 
previously  In  various  forms  by  Gustavson  [6]  and  Curtis  and  Reid  [2].  The  version  given  here 
Is  a row-oriented  scheme  In  which  nonzero  matrix  entries  are  atored  row  by  row,  although  a 
column-oriented  version  would  work  as  well.  Within  each  row,  nonzero  entries  are  stored  In 
order  of  Increasing  column  Index.  To  Identify  the  entries  of  any  row.  It  Is  necessary  to  know 
where  the  row  starts,  how  many  nonzero  entries  It  contains,  and  In  what  columis  the  nonzero 
entries  lie.  This  extra  Information  describes  the  zero  structure  of  the  matrix  and  Is  the 
storage  overhead  mentioned  earlier. 

Storing  the  matrix  A with  the  uncompressed  scheme  requires  three  arrays  (IA,  JA,  and  A), 
as  shown  In  Figure  3.1.  The  array  A contains  the  nonzero  entries  of  A stored  row  by  row.  IA 
contains  N+l  pointers  that  delimit  the  rows  of  nonzeroea  in  the  array  A — A(IA(I))  is  the  first 
stored  entry  of  the  1th  row.  Since  the  rows  are  stored  consecutively,  the  number  of  entries 
stored  for  the  1th  row  Is  given  by  IA(I+1)  - IA(I).  (IA(I+1)  Is  defined  so  that  this  holds  for 
the  H th  row.)  The  array  JA  contains  the  column  indices  that  correspond  to  the  nonzero  entries 
In  the  array  A — If  A(K)  contains  a^,  then  JA(K)  ■ J.  The  storage  overhead  Incurred  by  using 
the  uncompressed  storage  scheme  for  A Is  the  storage  for  IA  and  JA.  Since  IA  has  N+l  entries 
end  JA  has  one  entry  per  entry  of  the  array  A,  tha  storage  overhead  la  approximately  equal  to 
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Figure  3.1. 


the  number  of  nonzeroee  In  the  matrix  A. 

Previous  lap  lenient  at  Ions  of  sparse  Gaussian  elimination  have  also  used  variants  of  the 

uncompressed  storage  scheme  for  storing  L and  U,  as  shown  in  Figure  3.2.  Again  the  storage 

overhead  la  approximately  equal  to  the  number  of  nonzero  entries  In  the  two  matrices.  Storing 

L and  U In  this  way  has  the  advantage  that  the  operational  overhead  In  Implementation  is  quite 

email,  since  the  data  structures  are  simple  and  the  matrix  entries  can  be  accessed  quickly. 

In  certain  situations  where  storage  is  at  a premium,  however.  It  Is  essential  to  reduce 

the  storage  overhead,  even  If  the  operational  overhead  la  Increased.  This  can  be  done  by 

storing  L and  U with  a more  complex  compressed  storage  scheme  (cf.  Elsenstat,  Schultz,  and 

Sherman  [4],  Sherman  [8]).  The  compressed  storage  scheme  will  Incur  more  operational  overhead 

than  the  uncompressed  scheme  but  the  storage  requirement  will  be  substantially  reduced.  In  the 

T 

compressed  storage  scheme  L la  stored  by  columns  (or,  equivalently,  L la  stored  by  rows)  and  U 
tn  stored  by  rows.  Figure  3.3  Illustrates  the  derivation  of  the  compressed  storage  scheme  for 
0.  (The  derivation  for  L la  similar,  using  a column-oriented  scheme.) 

Figure  3.3a  shows  the  data  structures  required  to  store  U In  the  uncompressed  storage 
schema.  It  Is  Immediately  evident  that  the  diagonal  entries  do  not  need  to  be  stored,  since 
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Figure  3.2. 


they  ere  always  equal  to  1 and  occur  as  the  first  stored  entry  of  each  row. 

Figure  3.3b  shows  the  data  structures  required  when  the  diagonal  entries  are  omitted, 
tfe  now  note  that  the  Indices  In  JU  for  certain  rows  of  U are  actually  final  subsequences  of  the 
Indices  for  previous  rows.  For  example,  the  Indices  for  row  3 are  4,5,  while  those  for  row  2 
are  3,4,5.  Instead  of  storing  the  Indices  for  row  3 separately,  we  can  simply  make  use  of  the 
lest  two  Indices  stored  for  row  2.  All  that  Is  required  Is  a pointer  to  locate  the  Indices  for 
row  3. 

In  general,  the  indices  In  JU  can  be  compressed  by  deleting  the  indices  for  any  row  If 
they  already  appear  as  a final  subsequence  of  some  previous  row  (see  Figure  3.4).  It  Is 
possible  to  compress  the  Indices  In  certain  other  cases  as  well,  but  testa  have  shown  that  very 
little  Is  gained  by  so  doing. 

Since  the  compressed  indices  In  JU  do  not  correspond  directly  to  the  nonzeroes  stored  In 
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the  array  0,  an  extra  array  of  pointer*  (IJU)  la  required  to  locate  the  lndlcea  for  each  row 
(aee  Figure  3.3c).  Thua  the  atorage  overhead  for  the  compressed  storage  of  U la  the  nuaber  of 
locations  required  for  IU,  JU,  and  IJU.  Although  this  overhead  can  be  larger  than  that  with 
the  uncompressed  acheaa,  there  are  Important  caaea  In  which  it  la  eubetantlally  smaller  (cf. 
Section  6;  Elsenstat,  Schults,  and  Sherman  [3,4];  Sherman  [8]). 
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4.  Three  Implementation  Designs 


In  this  section  we  describe  three  specific  implementation  designs,  which  illustrate  some  of  the 
tradeoffs  mentioned  earlier.  Designs  other  than  these  three  can  also  be  derived,  but  these 
Indicate  the  broad  spectrum  of  Implementations  that  are  possible. 

The  first  Implementation  (SGE1)  Is  designed  for  speed.  It  uses  the  uncompressed  storage 
scheme  for  A,  L,  and  U because  of  the  smaller  operational  overhead  associated  with  It. 
Furthermore,  we  combine  the  NUKFAC  and  SOLVE  steps  to  avoid  saving  the  numeric  entries  of  L,  so 
that  the  computation  consists  of  a S YMF AC  step  follcved  by  the  NUMSLV  step. 

The  second  Implementation  (SCE2)  Is  designed  to  reduce  the  storage  requirements.  The 
entire  computation  Is  performed  in  a TRKSLV  step  to  avoid  storing  either  the  description  or  the 
numerical  entries  of  L.  Moreover,  U is  stored  with  the  compressed  storage  scheme  to  reduce  the 
storage  overhead.  This  design  Incurs  more  operational  overhead  than  SGE1;  the  total  storage 
requirements , however,  are  much  smaller. 

Finally,  the  third  implementation  (SGE3)  attempts  to  balance  the  design  goals  of  speed 
and  small  size.  It  splits  the  computation  as  In  SGE1  to  avoid  storing  the  numerical  entries  of 
L and  It  uaes  the  compressed  storage  scheme  as  In  SGE2  to  reduce  storage  overhead. 


5.  Quantitative  Comparisons 
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In  order  to  compare  the  designs  of  Section  4,  we  teeted  then  on  five-point  model  problems,  l.e. 
linear  systems  of  the  form 

An  * " b»  (5.1) 

where  A Is  a permutation  P A P ^ of  the  n*n  block  tridiagonal  matrix  A given  by 
n ana 


Here  the  blocks  B,  C,  and  D are  n*n  with  B tridiagonal  and  C and  D diagonal,  and  P is  an  n2*n2 

n 

permutation  matrix  chosen  to  reduce  the  number  of  nonzeroes  In  the  factors  of  A and  the  amount 

n 

of  arithmetic  required  to  solve  (5.1).  Systems  like  (5.1)  arise  frequently  in  the  numerical 
solution  of  partial  differential  equations  in  rectangular  domains  (cf.  Woo,  Elsenstat,  Schultz, 
and  Sherman  [9]  for  a specific  example.) 

Our  experiments  were  designed  to  answer  two  questions.  First,  what  are  the  storage  and 
CTO- time  costs  of  solving  the  model  systems  with  each  of  the  three  implementations?  Second, 
how  large  a model  system  can  each  of  the  Implementations  solve  In  a fixed  amount  of  core 
storage  (262,144  words)? 

The  results  shown  In  Tables  5.1  and  5.2  were  obtained  on  an  IBM  System/370  Model  158 


n 

SGE1 

SGE2 

SGE3 

20 

17,324 

12,789 

15,508 

25 

29,038 

20,971 

25,519 

30 

44,226 

30,909 

37,334 

40 

S3, 696 

57,653 

69,352 

50 

142,382 

94,021 

113,346 

n 

SGE1 

SGE2 

SGE3 

60 

214,788 

137,621 

164,364 

65 

257,510 

— 

194,547 

70 

>262,144 

— 

227,740 

75 

>262,144 

— 

>262,144 

80 

>262,144 

256,155 

>262,144 

Table  5.1. 

Storage  Requirements  for  the  Model  Problem. 


n 

SCE1 

(NUMSLV) 

SGE2 

(TRKSLV) 

SCE3 

(NUMSLV) 

n 

SCE1 

(NUMSLV) 

SCE2 

(TRKSLV) 

SCE3 

(NUMSLV) 

20 

.58 

1.15 

.71 

60 

18.04 

31.55 

20.31 

25 

1.14 

2.23 

1.40 

65 

— 

— 

25.73 

30 

2.05 

3.86 

2.45 

70 

— 

— 

31.86 

40 

5.00 

9.25 

5.91 

80 

— 

78.15 

— 

50 

10.22 

18.07 

11.52 

Table  5.2. 

Timings  for  the  Model  Problem  (In  seconds). 


using  the  FORTRAN  IV  Level  H Extended  compiler.*  For  n - 60,  SGE1  Is  the  fastest 
Implementation,  requiring  40-45Z  less  time  than  SGE2  and  10-15Z  less  time  than  SGE3.  On  the 
other  hand,  SGE2  requires  less  storage,  using  35-40Z  less  than  SGE1  and  15-20Z  less  than  SGE3. 
Furthermore,  we  see  that  SGE2  can  solve  larger  problems  (n  - 80)  than  either  SGE1  (n  *=  65)  or 
SGE3  (n  “ 70).  Evidently,  then,  the  qualitative  comparisons  suggested  In  Section  4 are  borne 
out  In  practice. 


6.  Conclusion 


In  this  paper  we  have  considered  the  design  of  Implementations  of  sparse  Gaussian  elimination 
In  terms  of  the  competing  goals  of  flexibility,  speed,  and  small  size.  We  have  seen  that  by 
varying  certain  aspects  of  the  design.  It  is  possible  to  vary  the  degree  to  which  each  of  these 
goals  Is  attalnsd.  Indeed,  there  seems  to  be  almost  a continuous  spectrum  of  possible  designs 
— SCE1  and  SGE2  are  Its  endpoints,  while  SGE3  Is  just  one  of  many  Intermediate  points.  There 
Is  no  single  Implementation  that  Is  always  best;  the  particular  Implementation  that  should  be 
used  in  a given  situation  depends  on  the  problems  to  be  solved  and  the  computational 


* We  are  Indebted  to  Dr.  P.  T.  Woo  of  the  Chevron  Oil  Field  Research  Company  for  running  these 
experiments  for  us. 
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IH* 

environment  la  which  the  calculation*  are  to  be  performed. 
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